首页> 外文OA文献 >BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
【2h】

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data

机译:BlinkDB:​​有限错误的查询和有限的响应时间   大数据

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

In this paper, we present BlinkDB, a massively parallel, sampling-basedapproximate query engine for running ad-hoc, interactive SQL queries on largevolumes of data. The key insight that BlinkDB builds on is that one can oftenmake reasonable decisions in the absence of perfect answers. For example,reliably detecting a malfunctioning server using a distributed collection ofsystem logs does not require analyzing every request processed by the system.Based on this insight, BlinkDB allows one to trade-off query accuracy forresponse time, enabling interactive queries over massive data by runningqueries on data samples and presenting results annotated with meaningful errorbars. To achieve this, BlinkDB uses two key ideas that differentiate it fromprevious work in this area: (1) an adaptive optimization framework that buildsand maintains a set of multi-dimensional, multi-resolution samples fromoriginal data over time, and (2) a dynamic sample selection strategy thatselects an appropriately sized sample based on a query's accuracy and/orresponse time requirements. We have built an open-source version of BlinkDB andvalidated its effectiveness using the well-known TPC-H benchmark as well as areal-world analytic workload derived from Conviva Inc. Our experiments on a 100node cluster show that BlinkDB can answer a wide range of queries from areal-world query trace on up to 17 TBs of data in less than 2 seconds (over100\times faster than Hive), within an error of 2 - 10%.
机译:在本文中,我们介绍了BlinkDB,这是一种大规模并行,基于采样的近似查询引擎,用于对大量数据运行临时的交互式SQL查询。 BlinkDB建立的关键见解是,在没有完美答案的情况下,人们常常可以做出合理的决定。例如,使用分布式系统日志可靠地检测服务器故障并不需要分析系统处理的每个请求。基于此见解,BlinkDB可以权衡查询准确性以响应时间,从而通过运行查询来对海量数据进行交互式查询在数据样本上显示结果,并用有意义的误差条注释。为了实现这一目标,BlinkDB使用了两个关键思想将其与该领域的先前工作区分开:(1)自适应优化框架,该框架可以根据原始数据构建并维护一组多维,多分辨率的样本,以及(2)动态的样本选择策略,可根据查询的准确性和/或响应时间要求选择适当大小的样本。我们已经构建了BlinkDB的开源版本,并使用著名的TPC-H基准以及Conviva Inc.提供的区域世界分析工作负载验证了其有效性。我们在100node集群上进行的实验表明BlinkDB可以回答各种各样的问题。来自区域世界查询的查询可在不到2秒的时间内跟踪多达17 TB的数据(比Hive快100倍以上),误差在2-10%之内。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号